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Abstract 

I define a natural measure of tlie complexity of a parametric distribution 
relative to a given true distribution called the razor of a model family. The Min- 
imum Description Length principle (MDL) and Bayesian inference are shown 
to give empirical approximations of the razor via an analysis that significantly 
extends existing results on the asymptotics of Bayesian model selection. I treat 
parametric families as manifolds embedded in the space of distributions and 
derive a canonical metric and a measure on the parameter manifold by ap- 
pealing to the classical theory of hypothesis testing. I find that the Fisher 
information is the natural measure of distance, and give a novel justification 
for a choice of Jeffreys prior for Bayesian inference. The results of this paper 
suggest corrections to MDL that can be important for model selection with a 
small amount of data. These corrections are interpreted as natural measures of 
the simplicity of a model family. I show that in a certain sense the logarithm of 
the Bayesian posterior converges to the logarithm of the razor of a model family 
as defined here. Close connections with known results on density estimation 
and "information geometry" are discussed as they arise. 



1 Introduction 

William of Ockham, a great lover of simple explanations, wrote that "a plurality is 
never to be posited except where necessary." |T^] The aim of this paper is to provide a 
geometric insight into this principle of economy of thought in the context of inference 
of parametric distributions. The task of inferring parametric models is often divided 
into two parts. First of all, a parametric family must be chosen and then parame- 
ters must be estimated from the available data. Once a model family is specified, 
the problem of parameter estimation, although hard, is well understood - the typical 
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difficulties involve the presence of misleading local minima in the error surfaces asso- 
ciated with different inference procedures. However, less is known about the task of 
picking a model family, and practitioners generally employ a judicious combination 
of folklore, intuition, and prior knowledge to arrive at suitable models. 

The most important principled techniques that are used for model selection are 
Bayesian inference and the Minimum Description Length principle. In this paper I 
will provide a geometric insight into both of these methods and I will show how they 
are related to each other. In Section |^ I give a qualitative discussion of the meaning 
of "simplicity" in the context of model inference and discuss why schemes that favour 
simple models are desirable. In Section |^ I will analyze the typical behaviour of Bayes 
rule to construct a quantity that will turn out to be a razor or an index of the simplicity 
and accuracy of a parametric distribution as a model of a given true distribution. In 
effect, the razor will be shown to be to be an ideal measure of "distance" between a 
model family and a true distribution in the context of parsimonious model selection. 
In order to define this index it is necessary to have a notion of measure and of 
metric on a parameter manifold viewed as a subspace of the space of probability 
distributions. Section ^ is devoted to a derivation of a canonical metric and measure 
on a parameter manifold. I show that the natural distance on a parameter manifold 
in the context of model inference is the Fisher Information. The resulting integration 
measure on the parameters is equivalent to a choice of Jeffreys prior in a Bayesian 
interpretation of model selection. The derivation of Jeffreys prior in this paper makes 
no reference to the Minimum Description Length principle or to coding arguments 
and arises entirely from geometric considerations. In a certain novel sense Jeffreys 
prior is seen to be the prior on a parameter manifold that is induced by a uniform 
prior on the space of distributions. Some relationships with the work of Amari et.al. 
in information geometry are described. ([0], 0) In Section ^ the behaviour of the 
razor is analyzed to show that empirical approximations to this quantity will enable 
parsimonious inference schemes. I show in Section |^ that Bayesian inference and the 
Minimum Description Length principle are empirical approximations of the razor. 
The analysis of this section also reveals corrections to MDL that become relevant when 
comparing models given a small amount of data. These corrections have the pleasing 
interpretation of being measures of the robustness of the model. Examination of the 
behaviour of the razor also points the way towards certain geometric refinements to 
the information asymptotics of Bayes Rule derived by Clarke and Barron. Close 
connections with the index of resolvability introduced by Barron and Cover are also 
discussed. (0) 

2 What is Simplicity? 

Since the goal of this paper is to derive a geometric notion of simplicity of a model 
family it is useful to begin by asking why we would wish to bias our inference pro- 
cedures towards simple models. We should also ask what the qualitative meaning 
of "simplicity" should be in the context of inference of parametric distributions so 
that we can see whether the precise results arrived at later are in accord with our 
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Figure 1: Parameter Manifolds in The Space of Distributions 

intuitions. For concreteness let us suppose that we are given a set of outcomes 
E = {ei ■ ■ ■ Cat} generated i.i.d. from a true distribution t. In some suitable sense, the 
empirical distribution of these events will fall with high probability within some ball 
around t in the space of distributions. (See Figure |l|.) Now let us suppose that we are 
trying to model t with one of two parametric families Mi or M2. Now Mi and M2 de- 
fine manifolds embedded in the space of distributions (see Figure [^) and the inference 
task is to pick the distribution on Mi or M2 that best describes the true distribution. 

If we had an infinite number of outcomes and an arbitrary amount of time with 
which to perform the inference, the question of simplicity would not arise. Indeed, 
we would simply use a consistent parameter estimation procedure to pick the model 
distribution on Mi or M2 that gives the best description of the empirical data and 
that would be guaranteed to give the best model of the true distribution. However, 
since we only have finite computational resources and since the empirical distribution 
for finite only approximates the true, our inference procedure has to be more 
careful. Indeed, we are naturally led to prefer models with fewer degrees of freedom. 
First of all, smaller models will require less computational time to manipulate. They 
will also be easier to optimize since they will generically have fewer misleading local 
minima in the error surfaces associated with the estimation. Finally, a model with 
fewer degrees of freedom generically will be less able to fit statistical artifacts in small 
data sets and will therefore be less prone to so-called "generalization error" . Another, 
more subtle, preference regarding models inferred from finite data sets has to do with 
the "naturalness" of the model. Suppose we are using a family M to describe a set of 
A^ outcomes drawn from t. If the accuracy of the description depends very sensitively 
on the precise choice of parameters then it is likely that the true distribution will be 
poorly modelled by M.(See Figure ^) This is for two reasons - 1) the optimal choice 
of parameters will be hard to find if the model is too sensitive to the choice, and 
2) even if we succeed in getting a good description of one set of sample outcomes. 
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Figure 2: Natural and Unnatural Models 

the parameter sensitivity suggests that another sample will be poorly described. In 
geometric terms, we would prefer model families which describe a set of distributions 
all of which are close to the true. (See Figure |^.) In a sense this property would 
make a family a more "natural" model of the true distribution t than another which 
approaches t very closely at an isolated point. 

The discussion above suggests that for practical reasons inference schemes oper- 
ating with a finite number of sample outcomes should prefer models that give good 
descriptions of the empirical data, have fewer degrees of freedom and are "natural" 
in the sense discussed above. I will refer to the first property (good description) as 
accuracy and the latter two (fewer degrees of freedom and naturalness) as simplicity. 
We will see that both accuracy and simplicity of parametric models can be under- 
stood in terms of the geometry of the model manifold in the space of distributions. 
This geometric understanding provides an interesting complement to the minimum 
description length approach, which gives an implicit definition of simplicity in terms 
of shortest description length of the data and model. 

3 Construction of The Razor Of A Model 

The previous section has discussed the qualitative meaning of simplicity and its prac- 
tical importance for inference of distributions from a finite amount of data. In this 
section we will construct a quantity that is an index of the accuracy and the simplicity 
of a model family as a description of a given true distribution. We will show in later 
sections that empirical approximations of this quantity which we call the razor of 
a model will enable consistent and parsimonious inference of parametric probability 
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distributions. 



3.1 Construction From Bayes Rule 

We will now motivate the definition of the razor via a construction from the Bayesian 
approach to model inference. (In later sections we will conduct a more precise analysis 
of the relationship between the razor and Bayes Rule.) Suppose we are given a 
collection of outcomes E = {ei . . . cn}, Ci G X drawn independently from a true 
density t, defined with respect to Lebesgue measure on X. Suppose also that we are 
given two parametric families of distributions A and B and we wish to pick one of 
them as the model family that we will use. The Bayesian approach to this problem 
consists of computing the posterior conditional probabilities Pt{A\E) and Pt{B\E) 
and picking the family with the higher probability. The conditional probabilities 
depend, of course, on the specific outcomes, and so in order to understand the most 
likely result of an application of Bayes Rule we should analyze the statistics of the 
posterior probabilities. Let A be parametrized by a set of parameters 6 = {9i, . . . Od}. 
Then Bayes Rule tells us that: 



In this expression Pr(yl) is the prior probability of the model family, w{Q) is a prior 
density with respect to Lebesgue measure on the parameter space and Pr{E) is a 
prior density on the outcome sample space. The Lebesgue measure induced by the 
parametrization of the d dimensional parameter manifold is denoted dn{Q). Since we 
are interested in comparing Pr(74|i?) with Pt{B\E), the prior Pt{E) is a common fac- 
tor that we may omit and for lack of any better choice we take the prior probabilities 
of A and B to be equal and omit them. In order to analyze the typical behaviour of 
Equation |I| observe that Pr(E|0) = niIiPr(ei|e) = exp [E^i In Pr(ei|e)] . Define 

Gi{e) = InPr(eiie) andF(e) = Ef=iGi{e). We see that F(e) is the sum of A^ iden- 
tically distributed, independent random variables. Consequently, as A^ grows large 
the Central Limit Theorem applies and we can write down the probability distribution 
for F as: 

where /x and a are the mean and standard deviation of the Gi in the true distribution 
and are defined as /i =< Gi{Q) >t= J dxt{x) lnPr(x|0) and =< Gi(0)^ >t — < 
Gi{Q) >f. The most likely value of F{Q) is Nfi and n can be written in the following 
pleasing form: 



<G.(e)>,= -/d.*Wln(Jl^)+/ 



+ J dxt{x) In {t{x)) = -D{t\\Q)-h{t) (3) 

where h(t), the differential entropy of the true distribution, is assumed finite, and 
D(t||0) is the relative entropy or Kullback-Liebler distance between t and the distri- 
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bution indexed by 6. This suggests that the following quantity is worthy of investi- 
gation: 

R%{A) oc / dfi{e) w{Q) exp -N {D{t\\e) - h{t)) (4) 



(The superscript C is intended to indicate that Equation^ is a candidate razor that we 
will improve in the subsequent discussion.) We will see in Section that R%{A) is 



closely related to the typical asymptotics of lnPr(£'|A). In the business of comparing 
'Pi{A\E) and Vi{B\E), exp —Nh{t) is a common factor. So we drop it and also 
note that in the absence of any prior information the most conservative choice for 
w{Q) appears to be the uniform prior on the parameter manifold. (We will return 
to examine this point critically and we will find that the natural prior is not in fact 
uniform in the parameters.) So we finally write our candidate razor as: 

We have assumed a compact parameter manifold so that the uniform distribution on 
the surface can be written as one over the volume. We take the integration measure 
(i/i(9) to be the Lebesgue measure induced on the manifold by the atlas defined via 
the parametrization. These definitions can be extended to non-compact parameter 
manifolds with a little bit of care, but we will not do this here. The quantity R%{A) 
defined in Equation ^ is our candidate for a natural meaure of the accuracy as well 
as the simplicity of a parametric model distribution. The construction of the razor 
in this section is intended to be motivational. We will see in Section |6.1| that the 
razor is closely related to the typical asymptotics of the logarithm of Pr(y4|ii^). Note 
that the razor is not a quantity that is estimated from data - it is a theoretical 
measure of complexity like the "index of resolvability" introduced by Barron and 
Cover and discussed in Section |6]^.([|]) We will show that an accurate estimator of the 
razor can be used to implement consistent and parsimonious inference of probability 
distributions. 



3.2 A Difficulty 

There is a major difficulty with an interpretation of W^IA) in Equation ^ as an intrin- 
sic measure of qualities such as the simplicity of a parametric family. This difficulty 
arises because we have not defined the integration measure sufficiently carefully. To 
see this in pedestrian terms, consider a family with two parameters, x and y, for 
which the naive integration measure in the razor would be dfi{Q) = dxdy. We could 
do the integration in polar coordinates (r and 0), in which case the measure would be 
dfi{Q) = rdrdcf). Now the model could have been specified in the first place in terms 
of the coordinates r and (f) in which case the naive integration measure would have 
been dfi{Q) = dr dcf) which will clearly yield a different definition of the razor. In other 
words, the razor as defined above is not reparametrization invariant and consequently 
measures something about both the model family and its parametrization. In order to 
define the razor as an intrinsic measure of the simplicity of a parametric distribution 
we need to have an invariant integration measure on the parameter manifold. This 
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is easily achieved - if we know how to introduce a metric on the surface, the metric 
will induce a measure with required properties. But what is the correct metric on 
the parameter manifold? Since the model is embedded in the space of distributions, 
the metric on the manifold should be induced from a natural distance in the space of 
distributions. 

Further insight into this issue is obtained by considering the Bayesian construction 
of the razor. In the course of this construction we assumed a uniform prior on the 
parameter manifold. Now the manifold itself is some parameter invariant object that 
lives in the space of distributions. Let A be a set of distributions in the parameter 
manifold. The uniform prior associated with different parametrizations will assign 
different measures to the set A. On the one hand we could say that the choice of 
parametrization of a model involves an implicit choice of measure on the manifold 
and that a parameter dependence is therefore to be expected in Bayesian methods. 
However, it seems more correct to say that we did not actually mean to say that 
all parameters are equally likely - our intention was to say that in the absence of 
any prior information, all distributions are equally likely. In other words, to apply 
Bayesian methods properly to the task of model inference, we have to find a way of 
assigning a uniform prior in the space of distributions and induce from that a measure 
on parameter manifolds. 

In the next section we will use the observations made in the previous paragraphs 
to derive a metric and a measure on the parameter manifold that make the razor 
a parameter-invariant measure of the simplicity of a model. We will find that the 
natural metric on the parameter manifold is the Fisher Information on the surface 
and the reparametrization invariant razor is consequently given by: 

^ . _ Idfi{e)Vd^exp-ND{t\\e) 

^ /rf/i(e)yditj 

where J is the Fisher Information matrix. The work of Rao, Fisher, Amari and 
others has previously suggested this choice of measure and metric. 0). However, 
as pointed out by these authors, there are many potential choices of metrics on 
parameter manifolds and the choice of a metric requires careful justification. Once a 
metric is chosen the standard apparatus of differential geometry may be unfolded and 
statistical interpretations can be attached to geometric quantitites. In the following 
sections I provide justifications for the choice of the Fisher Information as the metric 
appropriate to model estimation. 

The choice of \/ det J as the integration measure is equivalent to a choice of Jeffreys 



prior in the Bayesian interpretation of the razor. M) Jeffreys recommends this 



choice of prior because, as we will note in later sections, its definition guarantees the 
reparametrization invariance of the Bayesian posterior. However, the requirement of 
reparametrization invariance alone does not uniquely fix the prior - any prior which 
is defined to have suitable transformation properties under reparametrizations of a 
model will yield the desired invariance.Q Indeed, Jeffreys considers priors related 

"'^It is clear that any prior that is defined as the square root of the determinant of a two form on 
the parameter manifold will be invariant under reparametrizations. So reparametrization invariance 
is hardly sufficient to pick out a unique prior. 
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to different distances on the space of distributions including the norms and the 
relative entropy distance. I will show that choosing a Jeffreys prior is equivalent to 
assuming equal prior likelihood of all distributions as opposed to equal prior likelihood 
of parameters. 

From the point of view of statistical mechanics the razor can be interpreted as a 
partition function with energies given by the relative entropy and temperature 
Since temperature regulates the size of fluctuations in physical systems as does the 
number of events in statistical systems, this analogy makes good sense. In Section]^ we 
exploit the techniques of statistical mechanics to develop a systematic series expansion 
for the razor. The geometrical interpretation will become more clear as the reader 
proceeds further. 

4 Geometry of Parameter Manifolds 

In this section I will derive a natural metric and measure on a parameter manifold. 
We will see that the Fisher Information is the natural metric and the natural mea- 
sure is associated with this metric. The Fisher Information has a long history as a 
local measure of distance in the space of distributions starting with the Cramer-Rao 
bounds. The work of Fisher, Rao, Amari and others has elucidated the role of ge- 
ometry in statistics (]1[)[0]) ^^^1 there is a sizable literature on the construction and 
interpretation of geometric quantities in information theory. However, since there 
are many potential metrics in the space of distributions the important issue is to 
determine which metric is appropriate to a given problem. Once this is done, the 
theory of Riemannian manifolds provides the necessary technology for manipulating 
parametric families in the space of distributions and the difficult task is to identify 
the geometric quantities of interest to statistics. In this section we will present two 
derivations of the natural integration measure on a parameter manifold in the context 
of density estimation. 

4.1 Distance Induced By The Relative Entropy 

It is useful to start with a somewhat heuristic derivation that recapitulates arguments 
made in the "information geometry" literature. 0) Let us start by assuming that 
in the context of model inference, the natural distance on the space of distributions 
is the relative entropy D{p\\q). Unfortunately, D does not define a metric since 
it is not symmetric and does not obey triangle inequalities except in special cases. 
However, as we shall see, D{p\\q) will induce a Riemannian metric on a parameter 
manifold given suitable technical conditions. Let be a manifold in the space of 
distributions with local coordinates = {^^i . . .Oj}- Let p be a fixed point on M. 
and q be any other point. Then the relative entropy between p and g, D(Op||Gg) is 
a non-negative function of 6^ that attains its minimum at 0^ = 0p and the value of 
the minimum is zero. This means that the zeroth and first order terms in the Taylor 
expansion of D{Qp\\Qq) at p vanish identically. Letting A9 = 6^ — 6p, and assuming 
twice differentiability of the relative entropy in a neighbourhood of p, we can Taylor 
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expand to second order: 



D{&p\\eq) ^- J dx FT{x\&f 



Pr{x\ep) ae, oe 
1 apr(x|ep) apr(z|eg) 

Pr(x|Gp)^ de, ddj 



^ ^ PrfxIGp) - 

Ae*Ae^' (7) 



In the above equation as in all future equations, repeated indices are implicitly 
summed over. So, for example, there is an implicit sum on i and j in Equation ^ 
The first term in this equation vanishes if the derivatives with respect to 6i and 6j 
commute with the integral^. We assume this commutativity and recognize the re- 
maining term as one-half times the Fisher Information on the parameter manifold 
D{ep\\Qg) = (1/2) < deAnFr{x\ep)deAnFr{x\ep) >e, M'AO^ = (1/2) J,,A^*A^^'.g 

We have found that if we accept that the relative entropy is the natural measure 
of distance between distributions in the context of model estimation, the induced 
distance between nearby points on a parameter manifold is D{p, q) = {l/2)JijA6^A6^ 
to leading order in A^. Since the Fisher Information appears in this expression as a 
quadratic form, it is tempting to interpret it as the natural metric on the surface. We 
will only consider consider models where the determinant of the Fisher Information 
is non-vanishing everwhere on the surface. This non-degeneracy condition essentially 
guarantees that nearby points on a model manifold describe sufficiently different 
distributions. Since we derived the Fisher Information metric from a Taylor expansion 
at the minimum of a function we conclude that for the non-degenerate models that 
are of interest to us, the Fisher Information is a positive definite metric on the model 
manifold. Therefore, we can appeal to the standard theory of Riemannian geometry 
to observe that the reparametrization invariant integration measure on the manifold 
is V det J where J is the Fisher Information. Putting this measure into Equation ^ for 
the razor, we immediately get Equation |] which is now coordinate-independent and 
a candidate for a measure of some intrinsic properties of a parametric distribution. 



4.2 How To Count Distinguishable Models 

The Bayesian derivation of the razor of a model provides good intuitions for a more 
careful derivation of the integration measure. As we have discussed, in the Bayesian 
interpretation we would like to say that all distributions are equally likely. If this is 
the case we should give equal weight to all distinguishable distributions on a model 
manifold. However, nearby parameters index very similar distributions. So let us ask 
the question, "How do we count the number of distinct distributions in the neigh- 
bourhood of a point on a parameter manifold?" Essentially, this is a question about 
the embedding of the parameter manifold in the space of distributions. Points that 
are distinguishable as elements of may be mapped to indistinguishable points (in 
some suitable sense) of the embedding space. 

7 dxde.de^ FT{x\e) = dg^dg^ J da;Pr(a;|e) = dg^OgA - 

■^The same result can be arrived at by looking at the second order Taylor expansion around p of 
the symmetrized relative entropy D{p\\q) + D(q\\p). In this case there is no need to make any further 
assumptions about commutativity of the derivatives and integral. 



9 



To answer the question let us take p and q to be points on a parameter manifold. 
Since we are working in the context of density estimation a suitable measure of the 
distinguishability of 6p and 9^ should be derived by taking data points drawn 
from either p or q and asking how well we can guess which distribution produced the 
data. If p and q do not give very distinguishable distributions, they should not be 
counted separately in the razor since that would count the same distribution twice. 

Precisely this question of distinguishability is addressed in the classical theory of 
hypothesis testing. Suppose {ei . . . e^} G E'^ are drawn iid from one of /i and /2 with 
-D(/i||/2) < oo. Let An C be the acceptance region for the hypothesis that the 
distribution is /i and define the error probabilities = and jS^ = /^(^at). 

(A^ is the complement of A^r in E'^ and denotes the product distribution on E'^ 
describing N iid outcomes drawn from /.) In these definitions is the probability 
that /i was mistaken for /2 and jS^ is the probability of the opposite error. Stein's 
Lemma tells us how low we can make (5^ given a particular value of a at. Indeed, let 
us define: 

P% = min pN (8) 

Then Stein's Lemma tells us that: 

hm hm lln/5^ = -Z}(M|/2) (9) 

By examining the proof of Stein's Lemma (|^) we find that for fixed e and sufficiently 
large N the optimal choice of decision region places the following bound on /3^: 

- D(m,) - + l^iii^ < iln,3J, < -DiUm + + i^^^i^ (10) 

where otv < e for sufficiently large A^. The 6n are any sequence of positive constants 
that satisfy the property that: 

c^N = /f (1^ - ^(/ill/2)l >SN)<e (11) 

for all sufficiently large A^. The strong law of large number numbers tells us that 
(l/iV)E.=iln(/i(e,)//2(ei)) converges to /^(/iH/s) almost surely since /^(/iH/s) = 
£^/^(ln(/i(ei)//2(ej)). Almost sure convergence implies convergence in probability so 
that for any fixed 6 we have: 

/f (I^Ei^yM -^(Ml/2)l >^)<^ (12) 

i=l J2[<^i) 

for all sufficiently large N. For a fixed e and a fixed A^ let a^ be the collection of 



S > which satisfy Equation |I2[ Let S^n be the infimum of the set A^^a^- Equation |T2 
guarantees that for any 6 > 0, for any sufficiently large N , < 6j\f < 6. We conclude 
that 6eN chosen in this way is a sequence that converges to zero as A^ — ^ cxd while 
satisfying the condition in Equation which is necessary for proving Stein's Lemma. 
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We will now apply these facts to the problem of distinguishability of points on a 
parameter manifold. 

Let 9p and 9^ index two distributions on a parameter manifold and suppose 
that we are given N outcomes generated independently from one of them. We are 
interested in using Stein's Lemma to determine how distinguishable Qp and 0^ are. 
By Stein's Lemma: 

-D{ep\\eg)-d,N{<d)g)+ — - — < — - — < -D{ep\\eg)+d,N{'d)g)+ 



(13) 

where we have written S^iy{Qq) and j3%{Qg) to emphasize that these quantities are 
functions of Qg for a fixed Qp. Let A = —D{Qp\\Qg) + (1/A^) ln(l — a^v) be the average 
of the upper and lower bounds in Equation |T3|. Then A > — D(6p||0g) + (1/A^) ln(l— e) 
because the 5eAr(0g) have been chosen to satisfy Equation |Tl|. We now define the set 
of distributions Un = {Qq : -D{ep\\eg) + (1/Ar)ln(l - e) > {1/N)\np*} where 
1 > /5* > is some fixed constant. Note that as — ^ oo, D{Qp\\Qq) for 
Qq e Un- We want to show that [/at is a set of distributions which cannot be 
very well distinguished from Qp. The first way to see this is to observe that the 
average of the upper and lower bounds on In j3% is greater than or equal to In (3* for 
Qg G Uj\f. So, in this loose, average sense, the error probability exceeds P* for 
Qq e Un. More carefully, note that (1/A^) In(l-aAr) > (1/A^) In(l-e) by choice of the 
(5,7v(0g). So, using Equation m we see that (1/iV) ln/?^(eg) > (1/iV) In/?* - 5,7v(6g). 
Exponentiating this inequality we find that: 

1 > {PhiQ,)f^''^ > e--^^-^®') (14) 

The significance of this expression is best understood by considering parametric fam- 
ilies in which, for every 9g, Xg(ej) = ln(6p(ej)/9g(ej)) is a random variable with 
finite mean and bounded variance, in the distribution indexed by Qp. In that case, 
taking b to be the bound on the variances, Chebyshev's inequality says that: 



<^^<^ (15) 



In order to satisy < e it suffices to choose 5 = (b/NeY^'^. So, if the bounded 
variance condition is satisfied, S^N^Qg) < (b/NeY^'^ for any Qg and therefore we have 



the limit limAr^oo supe^^g^^ SewiQq) = 0. Applying this limit to Equation |1J we find 
that: 

1 > lim inf [(3%iQg)f^^^ > 1 X lim inf e'^^^^®') = 1 (16) 

In summary we find that limiv^oo ^^^eq&UNiPhi'^q)]^^^^^ = 1- This is to be contrasted 
with the behaviour of Pjf{Qq) for any fixed Qg ^ Qp for which \imN^oo[PNiQq)]^^^^'' = 
exp — 1^(9^11 6g) < 1. We have essentially shown that the sets Un contain distribu- 
tions that are not very distinguishable from Qp. The smallest one-sided error prob- 
ability for distinguishing between Qp and Qq G Un remains essentially constant 
leading to the asymptotics in Equation ITB. 
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Define k = — ln/3* + ln(l — e) so that we can summarize the region Un of high 
probabihty of error (3* at fixed e as n/N > D{6p\\6g)^ As N grows large for fixed 
K, the distributions Gp and Og must be close in relative entropy sense and so we 
can write Qq = Qp + AB and Taylor expand the relative entropy on the mani- 
fold near p. By arguments identical to those made in Section |4.1j we conclude that 
D{Qp\\Qq) = {l/2)Jij{Qp)Ae'Ae^ + 0{AQ^) where, as before, we have used the in- 
dex summation convention and defined the Fisher Information Jij from the matrix of 
second derivatives of the relative entropy.0 So, the nearly indistinguishable region 11^ 
around 9p is summarized by Jij{Qp)A9'^A9^ < 2k/N + 0(A9^), which defines the 
interior of an ellipsoid on the parameter manifold. For large iV, k/N is small, and so, 
since the manifold is locally Euclidean, the volume of this ellipsoid is given by: 

/ 1^ \d/2 ]^ 

We refer to K,/3*,Ar as the volume of indistinguishability at levels e, P* and A^. It mea- 
sures the volume of parameter space in which the distributions are indistinguishable 
from Bp with error probabilities oa? < e and (P^)^^^^^ > (P*)^^^^^ exp —5eN, given N 
sample events. 

If P* is very close to one, the distributions inside Ve^i3*^N are not very distinguish- 
able and should not be counted separately in the razor. ( Equivalent ly, the Bayesian 
prior should not treat them as separate distributions.) We wish to construct a mea- 
sure on the parameter manifold that reflects this indistinguishability. We will also 
assume a principle of "translation invariance" in the space of distributions by sup- 
posing that volumes of indistinguishability at given values of A^, P* and e should 
have the same measure regardless of where in the space of distributions they are 
centered. In what follows we will define a sequence of measures that reflect indis- 
tinguishability and translation invariance at each level /?*, e and A^ in the space of 
distributions. The continuum measure on the manifold is obtained by considering the 
limits of integrals defined with respect to this sequence of measures. We begin with 
the Lebesgue measure induced on the model manifold by the parameter embedding 
m R'^. For convenience we will assume that the model manifold can be covered by a 
single parameter patch so that issues of consistent sewing of patches do not arise. A 
real function on the model manifold will be called a step map with respect to a finite, 
Lebesgue measurable partition of the manifold if it is constant on each set in the 
partition. For any Lebesgue measurable function / there is a sequence of step maps 
that converges pointwise to / almost everywhere and in L^.(|10|) We will assume that 
the Fisher Information matrix J is non-singular everywhere and is component-wise 
Lebesgue measurable. The determinant of J will consequently be everywhere finite 
and also Lebesgue measurable. 

Let / be a step map with respect to some partition A = {Ai} of the model manifold 
and let Jij be a Fisher Information matrix which is non-singular everywhere and a step 

^We will eve ntua lly take the limits — > oo, e — > and /?* — > 1 in that order. 
^See Section 4.1 for the assumptions concerning differentiability and commutation of derivatives 
and integrals. 
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map with respect to a partition B = {Bi}. Let K = {Ai ClBj : Ai & A , Bj G B} be 
a partition of the manifold such that ii Ki & K then both / and J are constant on Ki. 
Consider fixed values of (3*, e and N in the above definition of the volumes of indistin- 
guishability. At fixed /3* , e and N we would like to define a measure Ue/s* n by covering 
the sets Ki & K economically with volumes of indistinguishability and placing delta 
functions at the center of each volume in the cover. Such a definition would give each 
volume of indistinguishability equal weight in an integral over the model manifold 
and would ignore variations in an integrand on a scale smaller than these volumes. 
As such, the definition would reflect the properties of indistinguishability and trans- 
lation invariance at fixed e and A^. As the volumes of indistinguishability shrink 
we could hope to define a continuum limit of this discrete sequence of measures. The 
following discussion gives a careful prescription for carrying out this agenda. The 
argument should be regarded as a "construction" consistent with the principles of 
indistinguishability and translation invariance rather than as a "derivation" . 

Since the program outlined above involves covering arbitrary measurable subsets 
of with volumes of indistinguishability, we begin by amassing some useful facts 
about covers of i?*^ by spheres. (See 0) Let Cr be a cover of R'' by spheres of radius 
r. Let H <Z have finite Lebesgue measure and let NniCr) be the number of sphere 
in Cr that intersect H. Define the covering density of H induced by Cr, D{H,Cr) , 
to be: 

a, = ^«<g-) "-<'■' (18) 

where Vd{r) is the volume of a d dimensional sphere of radius r and fi{H) is the 
Lebesgue measure of H. Let Sl be a square of side L centered at any point in R'^. 
For a fixed covering radius r, let t = r/L and let Nsj^ be the number of spheres of 
Cr that intersect S^. Then define the covering density of R induced by Cr to be: 

DiR", Cr) = lim D{Sl, Cr) = lim (19) 

SO long as this limit exists. (Usually the limit L ^ 00 is taken, but we will find the 
current formulation easier to work with.) Let a minimal cover of R'^ with spheres 
of radius r be a cover that attains the minimum possible D{R'^,Cr) over all covers 
Cr- This minimal density is independent of r and so we will write it as D{d). To 
show this independence, suppose that there is an r dependence and that the minimal 
densities for ri and r2 have the relationship D{d,ri) < D{d,r2). Then by rescaling 
the coordinates of i?'^ by r2/ri we can convert the cover by spheres of radius ri into 
a cover by spheres of radius r2. However, Equation ^ shows that the density of 
the cover would remain unchanged since the sides of the squares L would increase in 
length by r2/ri. This would give a cover with radius r2 whose density is less than 
the density D{d,r2) implying that the latter density cannot be minimal. It is well 
known that the minimal density for covering R'^ with spheres, D{d), is greater than 
1 so that the volumes of indistinguishability used in covering parameter manifolds 
will necessarily intersect each other. We will pick the minimal cover using volumes of 
indistinguishability in order to minimize overcounting of distributions in the measure 
that will be derived via the construction presented in this paper. 
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The construction of a measure on a parameter manifold that respects indistin- 
guishabihty and translation invariance requires the property that the density of the 
covering by of any disjoint union of sufficiently large squares approaches D{d) 
when the limit in Equation |19| exists. Indeed, the following lemma is easy to show: 

Lemma 4.1 Let Cr be a covering of by spheres of radius r for which D{R'^,Cr) 
exists. Then for any e > there is a tq > such that if S is a finite union of squares 
intersecting at most on their boundaries and each of whose sides exceeds Lq satisfying 
rjU < To, then \D{S,Cr) - D{R\Cr)\ < e. 

Proof: Let Sl be a square of side L and let be the number of spheres in Cy that 
intersect Sl. Let Bi be the number of spheres that intersect the boundary of S^. 
Take SL-2r to be a square of side L — 2r, centered at the same location as 5*^. Then 
Bl < Nl — NL-2r- By Equation |l^, for any e', we can pick r/L to be small enough 
so that: 



D-e' 
D-e' 



< 



NL-2r Vd{r) 
^ NLVd(r) 



<D + e' 
<D + e' 



(20) 
(21) 



where D = D{R'^,Cr). Writing NL-.2r < Nl - Bl and using the upper bound in 



Equation ^ with the lower bound in Equation ^ we find: 

D + e' {BLlL'')vd{r) 



{l-2r/LY 



Solving for Bl/ L*^ we find that: 



Bl ^ D 
Ld - Vdir) 



-(l-2r/L)'^+l 



{l-2r/LY {l-2r/Ly 



2r/Ly 



(22) 



Vd{r) 



(23) 



This tells us that Bl/ L'^ can be made as small as desired by picking sufficiently small 
e' and t = r / L. Finally, let S be any finite union of squares Si of sides Li where every 
Li exceeds some given Lq and the Si intersect at most on their boundaries. Taking 
Ni to be the number of spheres in Cr intersecting Si, with Bi the number of spheres 
intersecting the boundary, we have the following bound on the density of the cover 
of of S: 

j:i{Ni~B,)/Lfv,{r)Lf ^ ^ , ^ Ei N,/ Lf v,{r) Lf 



< D{S,Cr) < 



(24) 



J2i Lf J2i Lf 

By picking r / Lq to be small enough we can make NiVd{r) / Lf as close as we want to 
D{R'^,Cr) and Bi/Lf as close as we want to zero. Consequently, since all the sums 
in Equation ^ are finite we can see that for any choice of e > 0, for sufficiently small 
r/Lo, \D{S,Cr) - D{R'^,Cr)\ < e. This proves the lemma. □ 

Lemma ^TT] has given us some understanding of covers of finite unions of squares. 
The next lemma gives control over covers of arbitrary Lebesgue measurable subsets 
of i?*^. The basic difficulty that we must confront is that there are subsets of R'^ of 
Lebesgue measure zero for which the covering density is not well defined. Since we 
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are interested in integration on parameter manifolds it is natural that such sets of 
measure zero will not contribute to the integral over the manifold. The following 
lemma shows how to find well-behaved subsets of any Lebesgue measurable set. 

Lemma 4.2 Let D{d) be the minimal density for covering by spheres. Let C = 
{Cri, Cr2, ■ ■ ■} be any sequence of covers of such that Vi —>■ as i ^ oo and 
D^W^jCrJ = D{d) for every i. Take G d to have a finite Lebesgue measure. 
Then there exists a sequence Hk ^ G such that a) Imik^cx} 1^{G — Hk) = and b) 
\imi^^D{Hk,Cr,) = Did). 

Proof: Let G G have finite Lebesgue measure. Let H = G° he the closure of 
the interior of G which differs from G at most by a set of measure zero. Then H can 
be written as a countable union of squares Si each of which has finite measure and 
which intersect at most on their boundaries. Let Rk = {Si : fi{Si) > l/k"^} be the set 
of these squares that have side greater than 1/k. It is clear that Hj. = [jsieR^. Si G H 
and that limfc^oo — Hk) = 0. This proves the first part of the lemma. Each Hk 
is a finite union of squares of side greater than 1/k that intersect at most on their 



boundaries. So, by Lemma [4.1| , for any e > there is a tq such that ii Vik < tq, then 
\D{Hk, Cr^) - D{d)\ < e. Since ^ in the limit i oo, D{Hk, C^J D{d). This 
proves the second part of the lemma. □ 

We have found that in the limit that the radius of covering spheres r goes to zero, 
any subset of R of finite Lebesgue measure can be covered up to a set of measure 
zero with a minimal thickness D{d). We will now use this lemma to construct a 
measure on a parameter manifold that reflects indistinguishability and translation 
invariance. Define a regular sequence {Hk} of a Lebesgue measurable set G to be one 



of the sequences {Hk} whose existence was shown in Lemma |4.2| . Now consider one of 
the sets Kp in which the function / and the Fisher Information Jij are constant. By 
rescaling the coordinates of Kp by Jij we transform the volumes of indistinguishability 
into spheres of volume: 

K,^*,^ = (27r«/iV)W2)/r(rf/2 + 1) (25) 

and change the measure of Kp from iJ,{Kp) to V det J n{Kp) where /i is the Lebesgue 
measure in the original coordinates. Now suppose that we want to integrate the 
step function / over the measurable domain I. Let Kpj = Kp f] / and let {Hpjk} 
be a regular sequence of Kpj. The transformed coordinates define an embedding 
of Kpj into R and we consider a minimal covering of R by transformed volumes of 
indistinguishability K,/3*,7V- This minimal covering induces a cover of Kp and therefore 
of each Hpjk- We define a measure VeP'Nk at levels e, P*,N and k for integration over 
I by placing a delta function at some point in the intersection of each covering sphere 
and Hpjk- This yields the following definition of integration of the step function /: 

Definition 1 Let {Kp} be the sets on which the step maps f and Jij are both constant. 
Then, at levels of indistinguishability e, j3* and N , and at level k in a regular sequence 
of each Kp, we define the integral of f over the measurable domain I to be: 

f dl^efS'Nk = ^fp Npjk (26) 
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where Npjk is the number of spheres that intersect Hpjk C Kp C in the cover of 
by the spheres K,/3-,Ar. 

We are actually interested in a measure '~iep*Nk normalized so that the integral of 1 
over the entire manifold gives unity. The normalization is easily achieved by dividing 
Equation ^ by the integral of 1 over the manifold M.. 

Definition 2 Let {Hpk} be a regular sequence of Kp and let Npk be the number of 
spheres that intersect Hpk C Kp C R'^ in the cover of R'^ by the spheres K,/3*,Ar- The 
normalized integral of the step function f over the domain I is given by: 

r, EpfpNpJk/N^/' 

The division by N'^^'^ is motivated by our desire to take the limit ^ oo. The 
definition in Equation ^ reflects the properties of indistinguishability and translation 
invariance by ignoring variations on a scale smaller than the volumes K,/3*,Ar and giving 
equal weight in the integral to all such volumes. 

We begin by taking the limit — > cxd so that the definition of the integral reflects 
indistinguishability in the limit of an infinite amount of data. This is followed by the 
limit A; — > CX3 so that the entire domains Kpj are included in the integral up to a set of 
measure zero. Then we will take the limits (3* 1 so that we are working with truly 
indistinguishable distributions. Finally we will take completions of / and Jij to 
arrive at the defintion of integration of any Lebesgue measurable / over a parmeter 
manifold with Lebesgue measurable and non-singular Fisher Information. The result 
of this sequence of limits is summarized in the following theorem. 

Theorem 4.1 Let ^ be the Lebesgue measure on a parameter manifold M. that is 
induced by the parametrization. Let f be any Lebesgue measurable function on the 
manifold and let the Fisher Information Jij be Lebesgue measurable and non-singular 
everywhere on the manifold. Let 7 be the normalized measure on M. that measures 
the volume of distinguishable distributions indexed by the parameters. Then 7 is 
absolutely continuous with respect to and if I is any Lebesgue measurable set, then: 



Iif V^et Jij dfi 
fdl= ^ I — (2^ 

I J J det Jij dfj, 



Proof: First of all, observe that the volumes K,/3*,Af used in covering the Hpjk have 
radius r = k/N . Consequently, the sequence of covers by V^^i3*^n for increasing N and 
the sets of the regular sequence {Hpjk} satisfy the conditions of Lemma Therefore, 
applying the lemma and the definition of the density of a cover (Equation |18D, we 
find: 

hm hm Npjk/N''/' = lim D{d)f,iHpJk)Jd^p^^^^^ 



D{dMKpj)^^p^^^l^ (29) 
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where Jp is the Fisher Information in the region Kp. (We have used the fact that the 
measure of Kpi in the coordinates in which the volumes of indistinguishabihty are 



spheres is n,{Kpi)^<lei Jp.) Therefore, both the numerator and denominator of the 
right hand side of Equation are finite sums of terms that approach finite hmits as 
N ^ oo. So we can evaluate the limits of these terms to write: 



lim lim / / ^7, 

A;— >oo N—>oo J J 



el3*Nk 



d/2 



EpfpDid) ^i{Kpj) ^det JpT{d/2 + l)/(27r«:) 
D{d) fL{Kp)JdetJpT{d/2 + l)/(27rfi:)^/2 



Epfp\/detJpfi{Kpj) 



J2p .JdetJpfiiKp) 

The right hand side is now independent of (3* and e permitting us to freely take 
the limits (3* ^ 1 and e — which gives us the definition of a normalized integral 
over truly indistinguishable distributions which we write as / fd'y. 

We now want to take the completion of the step maps Jij and / in order to 
arrive at the definition of integration of any function that is Lebesgue measurable on 
the manifold. First we take the completion of the Jij with respect to Lebesgue 



measure. By the standard theory of integration, the sums Y.p ydet Jpjj,{Kpj) converge 
to integrals to give the following definition of the integrals of step maps /. 



/d7 = ^-^^-^pi^^^ (31) 
I / V det Jdjj, 

where the step function / is constant on the sets Ai. We have arrived at a new 
measure on the manifold 'y{K) = (/^ a/ det Jdfi) / J \J det Jd\i where \i is the original 
Lebesgue measure. Since 7 and \i are absolutely continuous with respect to each 
other, the V- completion of step maps with respect to 7 describes the same class of 
the functions as the completion of step maps with respect to /i. We can therefore 
take the completion of / with respect to 7 to arrive at the following definition of 
integration of any Lebesgue measurable function on a parameter manifold: 

f.l; = ^^^^^^ (32) 

/ j V det JdyU 

where /i is the Lebesgue measure induced by the parametrization. As discussed above, 
this construction accounts for indistinguishabihty and translation invariance in the 
space of probability distributions. □ 

In sum, the normalized measure on the manifold that accounts for the indistin- 
guishabihty of neighbouring distributions is given by: 



d/i^det J,, 

"7 = I (33) 

/ (iyU J det Jij 



where d^ is the Lebesgue measure on the manifold induced by its atlas which in 
simple cases is simply the product measure c/'^G = OiLiC^^i- In this expression we 
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have taken the hmits 1 and ^ oo. The meaning of this is that we are 

dividing out the volume of the parameter space which contains models that will 
be perfectly indistinguishable (in a one-sided error) given an arbitrary amount of 
data. As discussed earlier, Equation ^ is equivalent to a choice of Jeffreys prior in 
the Bayesian formulation of model inference. We stated earlier that Jeffreys prior 
has the desirable property of being reparametrization invariant on account of the 
transformation properties of the Fisher Information that enters its definition, but that 
one could define many such quantities.^ It appears that the derivation in this paper 
may provide the first rigorous justification for a choice of Jeffreys prior for Bayesian 
inference that does not involve assumption of a Minimum Description Length principle 
or a statement concerning compact coding of data. The derivation suggests that the 
Fisher Information is the reparametrization invariant prior on the parameter manifold 
that is induced by a uniform prior in the space of distributions. As such it would seem 
to be the natural prior for density estimation in a Bayesian context. It is worthwhile 
to point out that the work of Wallace and Freeman and Barron and Cover (among 
others) has demonstrated that the optimal code derived from a parametric model 
should pick parameters from a grid distributed with a density inversely proportional 
to the determinant of the Fisher Information matrix. (|[T^,0) The continuum limit 
of these grids can be obtained in the fashion demonstrated here and would yield a 
Jeffreys prior on the parameter manifold. 

The reader may worry that the asymmetric errors a — and (3^1 are a 
little peculiar since they imply that p can be distinguished from q, but q cannot be 
distinguished from p. A more symmetric analysis can be carried out in terms of the 
Chernoff bound at the expense of a convexity assumption on the parameter manifold. 
Since the derivation of the measure using the Chernoff bound exactly parallels the 
derivation using Stein's Lemma and yields the same result, we will not present it 
here. Although the derivation in this section has focussed on deriving the measure 
on a parameter manifold, future sections will take the metric on the manifold to be 
the Fisher Information. 



4.3 Riemannian Geometry in The Space of Distributions 

I do not have enough space in this paper to recapitulate the theory of Riemannian ge- 
ometry in the setting of the space of probability distributions. I will therefore assume 
that the reader has a rudimentary understanding of the notions of vectors, connec- 
tion coefficients and covariant derivatives on manifolds. The necessary background 
can be gleaned from the early pages of any differential geometry or general relativity 
textbook. A discussion of geometry in a specifically statistical setting can be found 
in the works of Amari, Rao and others. ([|I|, 0) In the next section I will assume 
a rudimentary knowledge of geometry, but since we do not need any sophisticated 
results, the reader who is unfamiliar with covariant derivatives should still be able to 
understand most of the results. Table provides a few formulae that will be useful 
to such readers, but contains no explanations. 

^Indeed Jeffreys considers priors reiated to various distances sucti as tfie i'^ norms. ([p|) 
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Vectors are objects in the tangent space of a manifold. We write them with 
upper indices as V^. One forms are duals to vectors. We write them lower 
indices as VF^. Tensors are formed by taking tensor products of vectors and 
forms. The metric is a rank 2 tensor with two lower indices and is written as 
Quv The inverse metric is written with upper indices as g^^ . In this paper the 
metric on a parameter manifold is found to be the Fisher Information J^^. We 
map between vectors and one-forms (between upper and lower indices) using 
the metric or its inverse: e.g., = J^u and K"- 1^'^ = Jp^K'^^^'^ where we use 
the summation convention that repeated indices are summed over. We define 
the covariant derivative D on a manifold which acts as follows on functions 
(/), vectors (V^'^) and one-forms (V^): 



In these equations 9^ is the usual partial derivative with respect to the coor- 
dinate 6"^. Derivatives of higher tensors are defined analogously. The V^^ are 
the unique metric-compatible connection coefficients defined as follows: 



The covariant derivative of the metric vanishes using this connection and this 
elementary fact ia used in this paper. The curvature of the manifold is mea- 
sured by the failure of the covariant derivative to commute. The characteri- 
zation of curvature in a statistical setting can be found in the work of Amari, 
Rao and others. (|I[, §) 




(34) 



Table 1: Useful Geometric Equations 
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5 Parsimony and Consistency of The Razor 



In the previous sections we have constructed the razor from Bayes' Rule and discussed 
measures and metrics on parameter manifolds. We are left with a candidate for a 
coordinate invariant index of simplicity and accuracy of a parametric family as a 
model of a true distribution t: 



where the Fisher Information Jij is the metric on the manifold. In this section I 
will demonstrate that the razor has the desired properties of measuring simplicity 
and accuracy. In order to make progress various technical assumptions are necessary. 
Let G* be the value of G that globally minimizes D{t\\Q). I will assume that G* is 
a unique global minimum and that it lies in the interior of the compact parameter 
manifold. I will also assume that that D{t\\Q) and Jij{Q) are smooth functions of G 
in order that Taylor expansions of these quantities are possible. (Actually the degree 
of continuity required here depends on the accuracy of the approximation we seek 
and since we will only evaluate terms to 0{1/N) we will only require the existence of 
derivatives up to the fourth order for our computations.) Finally, let the values of the 
local minima be bounded away from the global minimum by some b. For any given b, 
for sufficiently large N, the value of the razor will be dominated by the neighbourhood 
of G*. Our strategy for evaluating the razor will be to Taylor expand the exponent 
in the integrand around G* and to develop a perturbation expansion in powers of 
We will omit mention of the 0(exp —bN) terms arising from the local minima. 
In their analysis of the asymptotics of the Bayesian marginal density Clarke and 
Barron introduce a notion of "soundness of parametrization" This condition 

is intended to guarantee that there is a one-to-one map between parameters and 
distributions and that distant parameters index distant distributions. In geometric 
terms this simply means that the parameter manifold is embedded in the space of 
distributions in such a way that no two separable points on the manifold are embedded 
inseparably in the space of distributions - i.e., the manifold does not fold back on itself 
or intersect itself. The conditions stated for the following analysis are much weaker 
because we only need "soundness" at the point on the manifold that is closest to the 
true distribution in relative entropy. Even this is merely a technical condition for 
ease of analysis - multiple global maxima of the integrand of the razor would simply 
contribute separately to the analysis and thereby increase the value of the razor. The 
most important conditions required in this paper are that Taylor expansions of the 
relevant quantities should exist at G*. 



parametric family is said to be "sound" if convergence of a sequence of parameter values is 
equivalent to weak convergence of the distributions indexed by the parameters. 




(35) 
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5.1 A Perturbative Expansion of The Razor 

We can begin the evaluation of the razor by rewriting it as: 

Rn{A) = - 



J djj,{Q) J det Jij 



(36) 



where Tr denotes trace. Define F{Q) = TrlnJij. Let J^^...^^ = V^^ ■ ■ • V^.-D(t||0)|e* 
be the nth covariant derivative of the relative entropy with respect to 6^^^ ■ ■ - 9^^ eval- 
uated at G*. Define the nth covariant derivatives of F(B) similarly. We can Taylor 
expand the exponent in Equation |3^ in terms of these quantities. Letting E be the 
exponent, we find that: 



E = -N 



1 = 



i=2 



Ml 



(37) 

To proceed further shift the integration variable to — O* = 5Q and rescale 
to integrate with respect to $ = \fN5Q. With this change of variables the razor 
becomes: 



Rn{A) 



(3^ 



/ djj,{&)y det Jij 

where G{^) collects the terms in the exponent that are suppressed by powers of A^: 



G(<f)= E.^i:7^ 



1 



J //.I ---/ 



(i+2)! '^Mi--Mi+2 



AMI 



(i+2) 



2i 



1 



— T (h^^ (h^^'^ ch^''^ 



+ 



2-^ Mir* 



N 



4! "^Mi- 



AMI 



+ o(^; 



(39) 



Note that the leading term in G($) is 0{1/ y N). The razor may now be evaluated in 
a series expansion using a standard trick from statistical mechanics. Define a "source" 
h = {hi . . . h^} as an auxiliary variable. Then it is easy to verify that the razor can 
be written as: 



Rn{A) 



m/^ Jdfi{e)JdetJ, 



(40) 



h=0 



where the derivatives have been assumed to commute with the integral. The function 
G($) has been removed from the integral and its argument (^ = (0^ . . . 0*^)) has been 
replaced by Vh = {dh^ ■ ■ ■ dh^}. Evaluating the derivatives and setting h = repro- 
duces the original expression for the razor. But now the integral is a simple Gaussian. 
The only further obstruction to doing the integral is that the parameter space is com- 
pact and consequently the integral is a complicated multi-dimensional error function. 
As our final simplifying assumption we will analyze a situation where 6* is sufficiently 
in the interior, or N is sufficiently large as to give negligible error when the integration 
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bounds are extended to infinity. The integral can now be done instantly. We find 
that: 



-(7VD(ti|0*)-(l/2)F(e*))„-G(Vh) 



Rn{A) = 

Expanding exp G and collecting terms gives: 



h=0 



(41) 



RNiA) = 



.) /27r\"'/2 /det Jij{Q*) 



1/2 , 



det J, 



l+0(-) 



(42) 



where we have defined V = J dfi{Q)V det J to be the volume of the parameter man- 
ifold measured in the Fisher Information metric.0 The terms of order arise 
from the action of G in Equation HTI. It turns out to be most useful to examine 



XAr(v4) = — lni?iv(^)- In that case, we can write, to order 1/A^: 



Xn{A) 



ND{t\\e*) + fin AT - i In (det 7,^(9*)/ det J^.^) - In 



J_ J ■^MlM2A'3M4 

N 



2 2 



+ 



22! 



{J 



2! 3! 3! 



+ . . . 



2! 4 2! 2! 



{J 



+ ... 



+ 



■^M2M3M 
2! 2 2! 3! 



-1^M3M4 



+ . . . 



(43) 



The ellipses within the parentheses indicate further terms involving all permutations 
of the indices on the single term that has been indicated and we have omitted terms 
of 0(1/A^^) and smaller. It is worthwhile to point out that the systematic series 
expansion above allows us to evaluate the razor to arbitrary accuracy for any rel- 
ative entropy functions whose Taylor expansion exists and whose derivatives grow 
sufficiently slowly with order. Therefore, this method of analyzing the asymptotics 
circumvents the need to place bounds on the higher terms since they can be explic- 
itly evaluated. The statistical mechanical idea of using such expansions around a 
saddlepoint of an integral could also find applications in other asymptotic analyses 
in information theory in which integrals are dominated by narrow maxima.^ In the 
next section we will discuss why this large analysis shows that the razor meaures 
simplicity and accuracy and we will analyze the geometric meaning of the terms in 



the above expansion. We will then discuss the connections between Equation ^ and 
the Minimum Description Length Principle. 



5.2 Parsimony and Consistency 

The various terms of Equation ^ tell us why the razor is a measure of the simplicity 
and accuracy of a parametric distribution. Models with higher values of Rn{A) 

*The terms of order 1/Vn integrate to zero because they are odd in while the Gaussian 
integrand and the integration domain in our approximation are even in (j). 

simple form of this method of integration has appeared before under the rubric "Laplace's 
Method" in the work of Barron and others. ([Q) 
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and therefore lower values of xn{A) are considered to be better. The 0{N) term, 
ND(t\\Q*) measures the relative entropy between the true distribution and the best 
model distribution on the manifold. This is a measure of the accuracy with which the 
model family A will be able to describe t. The geometric interpretation of this term 
is that it arises from the distance between the true distribution and the closest point 
on the model manifold in relative entropy sense. The 0(ln A^) term, {d/2) InN, tells 
us that the value of the log razor increases linearly in the dimension of the parameter 
space. This penalizes models with many degrees of freedom. The geometric reason 
for the existence of this term is that the volume of a peak in the integrand of the 
razor measured relative to the volume of the manifold shrinks more rapidly as a 
function of in higher dimensions. The 0(1) term, is even more interesting. The 
determinant of J^j^ is proportional to the volume of the ellipsoid in parameter space 
around 0* where the value of the integrand of the razor is significant .|^ The scale 
for determining whether det J~j^ is large or small is set by the Fisher Information 
on the surface whose determinant defines the volume element. Consequently the 
term (det J/ det J)^/^ can be understood as measuring the naturalness of the model 
in the sense discussed in Section |^ since it involves a preference for model families 
with distributions concentrated around the true. Another way of understanding this 
point is to observe from the derivation of the integration measure in the razor that 
given a fixed number of data points A^, the volume of indistinguishability around G* is 
proportional to (det J)^^/^. So the factor (det J/ det J)*^^/^) is essentially proportional 
to the ratio Viarge/Vindist, the ratio of the volume where the integrand of the razor 
is large to the volume of indistinguishability introduced earlier. Essentially, a model 
is better (more natural) if there are many distinguishable models that are close to 
the true. The term In {27rY/V can be understood as a preference for models that 
have a smaller invariant volume in the space of distributions and hence are more 
constrained. The terms proportional to 1/A^ are less easy to interpret. They involve 
higher derivatives of the metric on the parameter manifold and of the relative entropy 
distances between points on the manifold and the true distribution. This suggests 
that these terms essentially penalize high curvatures of the model manifold, but it is 
hard to extract such an interpretation in terms of components of the curvature tensor 
on the manifold. 

A consistent estimator of the razor can be used to implement parsimonious and 
consistent inference. Suppose we are comparing two model families A and B. We 
evaluate the razor of each family and pick the one with the larger razor. To evaluate 
the behaviour of the razor we have consider several different cases. First suppose that 
A is d-dimensional and B is a more accurate k-dimensional model with k > d. By 
saying that B is more accurate we mean that D(t||0^) < D{t\\Q\). We expect that 
for small A^ the terms proportional to InA^ will dominate and that for large A^ the 
terms proportional to A^ will dominate. We can compute the crossover number of 
events beyond which accuracy is favoured over simplicity. Ignoring the terms of 0(1) 
let us ask how large A^ must be so that Rn{B) > Rn{A). The answer is easily seen 

^"if we fix a fraction / < 1 where / is close to 1, the integrand of the razor will be greater that / 
times the peak value in an elliptical region around the maximum. 
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to be the solution to the equation: 

^^i^ <AD + 0{1/N) = D{t\\e*^) - D{t\\&*^) + 0{1/N) (44) 

Up to terms of 0{1/N), this is the expected crossover point between A and B if 
the inference used the Minimum Description Length principle based on stochastic 
complexity as introduced by Rissanen. ( 1 14 ] , [1T5| ) The 0(1) terms in the razor are 



important for small N and for cases where the models in question have parameter 
spaces of equal dimension. In that case, ignoring terms of 0{1/N) in the razor, the 
crossover point where R]\f{B) > R]\f{B) is given by: 

^ ^ Dm-/-Dmi) (I) + 1 (fM] 

The terms within the parentheses have been interpreted above in terms of the relative 
volume of the parameter space that is close to the true distribution (in other words, as 
a measure of robustness). So we see that if A is a more robust model than B then the 
crossover number of events is greater. The crossover point is inversely proportional 
to the difference in relative entopy distances between the true distribution and the 
best model and this too makes good intuitive sense. 

Further examinations of this sort show that the razor has a preference for simple 
models, but is consistent in that the most accurate model in relative entropy sense 
will dominate for sufficiently large N. Inference can be carried out with a countably 
large set of candidate families by placing the families in a list and examining the razor 
of the first families when A^ events are provided. It is clear that this procedure 
is then guaranteed to be asymptotically consistent while remaining parsimonious at 
each stage of the inference. Indeed the model which is closest to the true in relative 
entropy sense will eventually be chosen while simpler models may be be preferred for 
finite N. 



6 Various Meanings of The Results 

6.1 Relationship to The Asymptotics of Bayes Rule 

The analysis of the previous section has shown that the razor is parsimonious, yet 
consistent in its preferences. Unfortunately, in order to compute the razor one must 
already know the true distribution. It is certainly an index measuring the simplicity 
and accuracy of a model, but actual inference procedures must devise schemes to 
estimate the value of the razor from data. A good estimator of the razor will be 
guaranteed to pick accurate, yet simple models. So how do we estimate the razor of 
a model? 

Given the Bayesian derivation of the razor a natural candidate is the Bayesian 
posterior probability of a parametric model given the data. As discussed before, this 
probability is given by: 



^ ^ /rf/x(9)v/d^exp(lnPr(E|e)) 

/rf/i(e)v/dit7 
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We have used a Jeffreys prior which is the uniform prior on the space of probabihty 
distributions as discussed in previous sections. The relationship between the razor 
and the estimator in Equation ^ can be analyzed in various ways. The simplest 
relationship arises because the exponential is a convex function so that Jensen's In- 
equality gives us the following bound on the expectation value of Re{A) in the true 
distribution t: 



< REiA) >t > 



J dn{e)VdetJ exp < \nFT{E\e) >t 



RNiA) e 



-Nh{t) 



(47) 



where h is the differential entropy of the true distribution. So the razor times the 
exponential of the entropy of the true is a lower bound on the expected value of the 
Bayesian posterior. 

A sharper analysis may be carried out to show that under certain regularity as- 
sumptions the razor reflects the typical behaviour of Xe{A) = — ln(i?£;(y4)) — Nh{t). 
The first assumption is that lnPr(ii^|0) is a smooth function of G for every set of 
outcomes E = {ei, ■ ■ -cn} in the outcome sample space. (In fact, this assump- 
tion can be weakened to smoothness only in a neighbourhood of G*.) Using this 
premise, and the already assumed smoothness of Fisher Information matrix Jij(G), 



we can expand the exponent in Equation ^ around the maximum likelihood param- 
eter G = argmaxe lnPr(£'|G) to obtain: 



E 



-N 



N 



+-Fie* 



i=i 



59^1 . . . ^0M. 



(48) 



In this expression F(Q*) and F^-^...^. are the same as in Equation and we have 



defined /, 



-V 



Ml 



■ V^. lnPr(ii^|G)/iV. We will only consider models in which 
the empirical Fisher Information related to relative entropy distances between 
model distributions and the true, is nonsingular everywhere for every set of outcomes 
E. This is a condition ensuring that nearby parameters index sufficiently different 
models of the true distribution. By imitating the analysis of the razor (under the 
same assumptions as those listed for that analysis), we find that: 



^lnPr(_B|e)- 



-|F(e)g- 



■G(Vh) 



Re{A) 



(21)1 

dct/ 



1/2 



M1M2 le 



(49) 



where G is the same as in Equations and ^ with the substitution of / for every 
J. Defining xe{A) = -InREiA) - Nh{t) we find that to 0(1/ N): 



Xe{A) 



N 



-lnPr(£;|e) 
N 



h{t)) + l\nN -I ln(det Ji,(Q)/ det ^Iq) 



In 



(27r)''/^ 



+ +0{1/N) 



(50) 



Terms proportional to positive powers of may be computed as before, but we 
will not evaluate them explicitly here. It suffices to note that all terms of order 
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in Equation 50 are identical to the corresponding terms in Equation J3 with / 
substituted for J. 

We will now prove a theorem showing that any finite collection of terms in Xe{^) 
converges with high probability to xn{A) = — ItlRnIA) in the limit of a large number 
of samples. Throughout the discussion we will assume consistency of the maximum 
likelihood estimator in the following sense. Let U be any neighbourhood of G* = 
argmine Z}(t||G) on the parameter manifold and let E = {ci ■ ■ ■ e^} be any set of 
outcomes drawn independently from t. Then, for any < 5 < 1, we shall assume 
that the maximum likelihood estimator G = argmaxe In Pr(£'|G) falls inside U with 
probability greater than 1 — 6 for sufficiently large N. We also require that the log 
likelihood of a single outcome Cj, lnPr(ej|G), considered as a family of functions on G 
indexed by the outcomes e^, is an equicontinuous family at G*.(|jlO[) In other words, 
given any e > 0, there is a neighbourhood M of G*, such that for every e, and 
G e M, |lnPr(ej|G) — lnPr(ei|G*)| < e. Finally, we will require that all derivatives 
with respect to G of the log likelihood of a single outcome should be equicontinuous 
at G* in the same sense. 



Lemma 6.1 Let N be the number of iid outcomes E = {ci ■ ■ -Cn} arising from a 
distribution t and take e > and < 5 < 1. // the maximum likelihood estimator 
is consistent, Pr{\Jij{Q) — Jjj(G*)| > e) < 6 for sufficiently large N. (See above 
for definitions of Q* and Q.) Furthermore, if the log likehood of a single outcome is 
equicontinuous at G* (see definition above) then Pr(|(— lnPr(ii^|G) — D{t\Q*) — 
h{t))\ > e) < 6 for sufficiently large N. Finally, if the derivatives with respect to G 
of the log likelihood of a single outcome are equicontinuous at Q* , then Pr(|/^^...^- |q — 
J^^...^-| > e) < 6 for sufficiently large N. 

Proof: We have assumed that the Fisher Information matrix Jij{Q) is a smooth 
matrix valued function on the parameter manifold. By consistency of the maximum 
likelihood estimator Q — G* in probability. Since the entries of the matrix Jij are 
continuous functions of G we can conclude that Jij{Q) Jjj(G*) in probability also. 
This proves the first claim. To prove the second and third claims consider any func- 
tion of the form F]y(E, G) = (1/A^) J2iLi Fi{ei, G) where Fi{ei, G) is an equicontinuous 
family of functions of G at G*. We want to show that \Fn{E, Q) - Et[Fi{ei, G*)]| ap- 
proaches zero in probability where the expectation is taken in t, the true distribution. 
To this end we write: 

|F;v(^, G) - Et[F,{ei, Q*] \ < G*) - Et[F^{e„ Q*]\ + \F^{E, Q) - F^iE, Q*)] \ 

(51) 

The first term on the right hand side is the absolute value of the difference between the 
sample average of an iid random variable and its mean value. This approaches zero 
almost surely by the strong law of large numbers and so for sufficiently large the first 
term is less than e/2 with probability greater than 1 — 6/2 for any e > and < 5 < 1. 
In order to show that the second term on the right hand side converges to zero in 
probability, note that since Fi(ei,G) is equicontinuous at G*, given any e > there 
is a neighbourhood U of G* within which |F(ej, G) — F{ei, G*)| < e/2 for any Cj and 
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Q eU. Therefore, for any set of outcomes E and Q eU, \Fn{E, 6) - Fn{E, 0*)| = 
{l/N)\j:liiFM,e) - F,{e„e*))\ < {l/N)Elime„e) - F,ie„e*)\ < e/2. By 
consistency of the maximum hkehhood estimator, G t/ with probabihty greater 
than 1 - 6/2 for sufficiently large N. Consequently, Pr(|Fjv(^,e) - FN{E,e*)\ > 
e/2) < 5/2 for sufficiently large N. Putting the bounds on the two terms on the right 
hand side of Equation ^ together, and using the union of events bound we see that 
for sufficiently large N: 

Pr(|F^(E,e)-Ei[Fi(e„e*]| >e) <5 (52) 

To complete the proof we can observe that by assumption lnPr(ei|G) and its deriva- 
tives with respect to B are equicontinuous at 0* and that (— lnPr(ii^|0) and the 
various I^i-..^^ are therefore examples of the functions of F. Furthermore, ^'t [— In Pr(ej|0*)] 
D(t\\Q*) + h{t) and Et[I^-^^...^-\Q*] = J^^-..^^ under the assumption that derivatives with 



respect to commute with expectations with respect to t. On applying Equation [52 
to these observations, the theorem is proved. □ 

Note that Lemma |6.1| shows that the two leading terms in the asymptotic ex- 
pansions of xe{^) and xn{A) approach each other with high probability. We will 
now obtain control over the subleading terms in these expansions. Define Ck to be 
the coefficient of 1/N^ in the asymptotic expansion of Xe{^) so that we can write 
Xe{.A) - Nh{t) = Nc^i + {d/2) In + cq + {l/N)ci + il/N^)c2 + ■■■■ Let 4 be the 
corresponding coefficients of in the expansion of xn{A)- The are identical 

to the Ck with each I replaced by J. We can show that the Ck approach the dk with 
high probability. 



Lemma 6.2 Let the assumptions made in Lemma \6. 1\ hold and let e > and < 



6 < 1. Then for every intger k > —1, there is an such that 'Pi{\ck — dk\ > e) < 6 . 

Proof: The coefficient c_i = (— lnPr(ii^|G)) has been shown to approach 
d_i = D{t\\Q*) in probability as an immediate consequence of Lemma |6.1| . Next 
we consider Ck for k > 1. Every term in every such Ck can be shown to be a finite 
sum over finite products of constants and random variables of the form /^i-..^, and 
I^j}. We have already seen that /^i-.-^Je — > probability. The /^j^le are the 

entries of the inverse of the empirical Fisher Information I^uIq- Since the inverse is 
a continuous function, and since 7^,^ J^i, in probabihty, in probability 

also. As noted before, dk is identical to Ck with each / replaced by J . Since Ck 
is finite sum of finite products of random variables / that converge individually in 
probability to the J, we can conclude that Ck — >■ dk in probability. Finally, we consider 
cq- do = (-l/2)ln(det Jij(e)/det V) - (-1/2) ln(det Jij(e*)/ det J^^). We have 
shown that Jij{Q) Jij{Q*) and /^^ J^^ in probability. Since the determinant 
and the logarithm are continuous functions we conclude that CQ—d^ — in probability. 
□ 

We have just shown that each term in the asymptotic expansion of Xe{^) — h{t) 
approaches the corresponding term in xn{^) with high probability for sufficiently 
large A^. As an easy corollary of this lemma we obtain the following theorem: 
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Theorem 6.1 Let the conditions necessary for lemmas \6.]\ and 6A hold and take 
k' > k + 1 > to be integers. Then let Te{A, k, k') consist of the terms in the 
asymptotic expansion of Xe{A) that are of orders to 1/N'' . For example, 

Te(A, 4,6) = (1/A^^)c4 + {1/N^)c5 + {1/N^)cq, using the coefficients Ck defined above. 
Let Tn{A, k, k') be the corresponding terms in the asymptotic expansion of Xn{A.). 
Then for any k and k' , and for any e > and < 5 < 1, 'Pi{N^\TE{A.,k.,k') — 
T/v(y4, k, k')\ > e) < 6 for sufficiently large N . 

Proof: By definition of Te and Tjv, N''\Te{A, k, k') - Tn{A, k, k')\ = \ EiLki^i - 
di)/N'~''\ <EiLk\ci-di\/N'~\ ByLemma|3|Q-c/i| ^ in probability. Therefore, 
N^\Te{A, k, k') — Tn{A^ k, k')\ is a postive number that is upper bounded by a finite 
sum of random variables that individually converge to zero in probability. Since the 
sum is finite we can conclude that N''\Te{A, k, k') — Tisf{A, k, k')\ also converges to 
zero in probability thereby proving the theorem □ 

Note that the multiplication by A^'^ ensures that the convergence is not simply due 
to the fact that every partial sum Te{A, k, k') is individually decreasing to zero as the 
number of outcomes increases. Any finite series of terms in the asymptotic expansion 
of the logarithm of the Bayesian posterior probability converges in probability to the 
corresponding series of terms in the expansion of the razor. Theorem |6.1| precisely 
characterizes the sense in which the razor of a model refiects the typical asymptotic 
behaviour of the Bayesian posterior probability of a model given the sample outcomes. 

We can also compare the razor to the expected behaviour of Re{A) in the true 
distribution t. Clarke and Barron have analyzed the expected asymptotics of the 
logarithm of t{E) / Re{A) where t is the true distribution, under the assumption that 
t belongs to the parametric family ^4.(0) With certain small modifications of their 
hypotheses their results can be extended to the situation studied in this paper where 
the true density need not be a member of the family under consideration. The first 
modification is that the expectation values evaluated in Condition 1 of should be 
taken in the true distribution t which need not be a member of the parametric family. 
Secondly the differentiability requirements in Conditions 1 and 2 should be applied 
at G* which minimizes D{t\\Q). (Clarke and Barron apply these requirements at the 
true parameter value since they assume that t is in the family.) Finally, Condition 3 
is changed to require that the posterior distribution of G given X" concentrates on a 
neighbourhood of G* except for X" in a set of probability o(l/logX). Under these 
slightly modified hypotheses it is easy to rework the analysis of to demonstrate 
the following asymptotics for the expected value of Re{A): 

< - In Re{A) >t -h{t) = ND{t\e*) + ^ In (^) - ^ ln(det J/ det J) - \n{l/V) 

2 KzTreJ 2 

(53) 

We see that as X — > oo, < — In Re{A) >t —h(t) is equal to the razor up to a constant 
term d/2. More careful analysis shows that this term arises from the statistical 
fiuctuations of the maximum likelihood estimator of G* around G*. It is worth noting 
that while terms of 0(1) and larger in < \iaRE{A) >t depend depend at most on the 
measure (prior distribution) assigned to the parameter manifold, the terms of 0{1/N) 
depend on the geometry via the connection coefficients in the covariant derivatives. 
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For that reason, the 0{1/N) terms are the leading probes of the effects that the 
geometry of the space of distributions has on statistical inference in a Baysian setting 
and so it would be very interesting to analyze them. Normally we do not include 
these terms because we are interested in asymptotics, but when the amount of data is 
small, these correction terms are potentially important in implementing parsimonious 
density estimation. Unfortunately it turns out to be difficult to obtain sufficiently fine 
control over the probabilities of events to extend the expected asymptotics beyond 
the 0(1) terms and so further analysis will be left to future publications. 



6.2 Relationship to The Minimum Description Length Prin- 
ciple 

In the previous section we have seen that the Bayesian conditional probability of a 
model given the data is an estimator of the razor. In this section we will consider 
the relationship of the razor to the Minimum Description Length principle and the 
stochastic complexity inference criterion advocated by Rissanen. The MDL approach 
to parameteric inference was pioneered by Akaike who suggested choosing the model 
maximizing \n.'Pi{E\Q) — d with d the dimension of the model and G the maxi- 
mum likelihood estimator. (||^) Subsequently, Schwarz studied the maximization of 
the Bayesian posterior likelihood for densities in the Koopman-Darmois family and 
found that the Bayesian decision procedure amounted to choosing the density that 
maximized lnPr(ii^|0) — (l/2)(ilog A^.([|l^) Rissanen placed this criterion on a solid 
footing by showing that the model attaining mine,d{— log Pr(£'|0) + {l/2)d\ogN} 
gives the most efficient coding rate possible of the observed sequence amongst all 
universal codes. (|]14|,|[T^). In this paper we have shown that the razor of a model. 



which reflects the typical asymptotics of the logarithm of the Bayesian posterior, has 
a geometric interpretation as an index of the simplicity and accuracy of a given model 
as a description of some true distribution. In the previous section we have shown that 
the logarithm of the Bayesian posterior can be expanded as: 

Xe{A) = -\tiRe{A) = -lnPr(E|e) + f IniV - ^^^^ 



-In 



+ 0{l/N) (54) 



with B the maximum likelihood parameter and Ifi^-^i^ = —{1/N)'V ^-^ ■ ■ ■ V^. In Pr(ii^|B)|Q. 
The term of 0{1/N) that we have not explicitly written is the same as the the cor- 
responding term of the logarithm of the razor (Equation ^3]) with every J replaced 
by /. We recognize the flrst two terms in this expansion to be exactly the stochastic 
complexity advocated by Rissanen as a measure of the complexity of a string rela- 
tive to a particular model family. We have given a geometric meaning to the term 
{d/2) InN in terms of a measurement of the rate of shrinkage of the volume in pa- 
rameter space in which the likelihood of the data is signiflcant. Given our results 
concerning the razor and the typical asymptotics of xe{A), this strongly suggests 
that the deflnition of stochastic complexity should be extended to include the sub- 



leading terms in Equation 54. Indeed, Rissanen has considered such an extension 
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based on the work of Clarke and Barron and finds that the terms of 0(1) in the 
expected value of Equation ^ remove the redundancy in the class of codes that meet 
the bound on the expected coding rate represented by the earlier definition of stochas- 
tic complexity. (|jl3|) Essentially, in coding short sequences we are less interested in 
the coding rate and more interested in the actual code length. This suggests that 
for small the 0{1/N) terms can be important in determining the ideal expected 
codelength but it remains difficult to obtain sufficient control over the probabilities 
of rare events to extend the Rissanen's result to this order. As mentioned earlier, the 
metric on the parameter manifold affects the terms of 0{1/N) and therefore these 
corrections would be geometric in nature. 

Another approach to stochastic complexity and learning that is related to the 
razor and its estimators has been taken recently by Yamanishi.([0) Let T-C^ = {/e} 
be a hypothesis class indexed by d-dimensional real vectors G. Then, in a general 
decision theoretic setting Yamanishi defines the Extended Stochastic Complexity of 
a model A relative to the data E, the class Tt^, and a loss function L to be: 

I{D^ : H'') = In / de 7r(G)e-^Sf=i ^^^-^e) (55) 

where A > and 7r(G) is a prior. Following the work described in this paper he 
defines the razor index of A relative to L, Ti.'^ and a given true distribution p to be: 

: 'H'') = In/ dQ 7r(G)e-^^-E-i ^(^-^®)] (56) 



For the case of a loss function L[E, /e) = — lnPr(ii^|G), Equations ^ and |55| reduce 
to the quantities xn{A) and Xe{A) which are the logarithm of the razor and its 
estimator. Yamanishi shows that if the class of functions Ti = {/e(X)} has finite 
Vapnik-Chervonenkis dimension, then {1/N)\I{D^ : T-C^) — In{p '■ < e with high 
probability for sufficiently large N. For the case of a logarithmic loss function this 
result applies to the razor and its estimator as defined in this paper. 



6.3 "Physical" Interpretation of The Razor 

There is an interesting "physical" interpretation of the results regarding the razor and 
the asymptotics of Bayes Rule which identifies the terms in the razor with energies, 
temperatures and entropies in the physical sense. Many techniques for model estima- 
tion involve picking a model that minimizes a loss function exp Lq{E) where E is the 
data, G are the parameters and L is some empirical loss calculated from it. The typi- 
cal behaviour of the loss function is that it grows as the amount of data grows. In the 
case of maximum likelihood model estimation we take Lq{E) = —N{— In Pr(ii^|G))/A^ 
where we expect — lnPr(£'|G)/A^ to attain a finite positive limit as N —>■ 00 under 
suitable conditions on the process generating the data. In this case we can make 
an analogy with physical systems: is like the inverse temperature and the limit 
of — lnPr(i?|G)/A^ is like the energy of the system. Maximum likelihood estimation 
corresponds to minimization of the energy and in physical terms will be adequate to 
find the equilibrium of the system at zero temperature (infinite N). On the other 



30 



hand we know that at finite temperature (finite N) the physical state of the system is 
determined by minimizing the free energy E—T S where T is the temperature and S is 
the entropy. The entropy counts the volume of configurations that have energy E and 
accounts for the fiuctuations inherent in a finite temperature system. We have seen 
in the earlier sections that terms in the razor and in the asymptotics of Bayes Rule 
that account for the simplicity of a model arise exactly from such factors of volume. 
Indeed, the subleading terms in the extended stochastic complexity advocated above 
can be identified with a "physical" entropy associated with the statistical fiuctuations 
that prevent us from knowing the "true" parameters in estimation problems. 



6.4 The Natural Parametrization of A Model 

The evaluation of the razor and the relationship to the asymptotics of Bayes Rule 
suggest how to pick the "natural" parametrization of a model. In geometric terms, the 
"natural" coordinates describing a surface in the neighbourhood of a given point make 
the metric locally fiat. The corresponding statement for the manifolds in question here 
is that the natural parametrization of a model in the vicinity of Gq reduces the Fisher 
Information Jij at 0o to the identity matrix. This choice can also be justified from 
the point of view of statistics by noting that for a wide class of parametric families the 
maximum likelihood estimator of Go is asymptotically distributed as a normal density 
with covariance matrix Jy. If Jij is the identity in some parametrization, then the 
various components of the maximum likelihood estimator are independent, identically 
distributed random variables. Therefore, the geometric intuitions for "naturalness" 
are in accord with the statistical intuitions. In our context where the true density 
need not be a member of the family in question, there is another natural choice 
in the vicinity of 0* that minimizes D{t\\Q). We could also pick coordinates in 
which Jij = VjVj-D(t||6)|e* is reduced to the identity matrix. We have carried out 
an expansion of the Bayesian posterior probability in terms of B which maximizes 
lnPr(£'|G). We expect that B is asymptotically distributed as a normal density with 
covariance J. The second choice of coordinates will therefore make the components 
of independent and identically distributed. 



6.5 Minimum Complexity Density Estimation 



There are numerous close relationships between the work described in this paper and 
previous results on minimum complexity density estimation. The seminal work of 
Barron and Cover introduced the notion of an "index of resolvability" which was 
shown to bound covergence rates of a very general class of minimum complexity 
density estimators. This class of estimators was constructed by considering densities 
which achieve the following minimization: 



mm 

1 



N 



(57) 



where the Xi are drawn iid from some distribution, q belongs to some countable 
list of densities, and the set of L{q) satisfy Kraft's inequality. (0) Equation ^can be 
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interpreted as minimizing a two stage code for the density q and the data. The "index 



of resolvabihty" Rn{p) of p is constructed from expectation value in p of Equation |57 
divided by A^, the number of samples: 



Rn{p) = min 



^ + D(P\W) 



(5S 



where the L„ are description lengths of the densities and D is the relative entropy. 
This quantity was shown to bound the rates of convergence of the minimum complex- 
ity estimators. In a sense the density achieving the minimization in Equation ^ is 
a theoretical analog of the sample-based minimum complexity estimator arising from 
Equation 

The work of Barron and Cover starts from the assumption that description length 
is the correct measure of complexity in the context of density estimation and that 
minimizing this complexity is a good idea. They have demonstrated several very 
general and beautiful results concerning the consistency of the minimum description 
length principle in the context of density estimation. We also know that minimum de- 
scription length principles lead to asymptotically optimal data compression schemes. 
The goal of this paper has been to develop some alternative intuitions for the practical 
meaning of simplicity and complexity in terms of geometry in the space of distribu- 
tions. The razor defined in this paper, like the index of resolvabihty, is an idealized 
theoretical quantity which sample-based inference schemes will try to approximate. 
The razor reflects the typical order-by-order behaviour of Bayes Rule just as the index 
of resolvabihty reflects the expected behaviour of the minimum complexity criterion 
of Barron and Cover. 

In order to compare the two quantities and their consequences we have to note 
that Barron and Cover do not work with families of distributions, but rather with a 
collection of densities. Consequently, in order to carry out inference with a parametric 
family they must begin by discretizing the parameter manifold. The goal of this paper 
has been to develop a measure of the simplicity of a family as a whole and hence we 
do not carry out such a truncation of a parameter manifold. Under the assumption 
that the true density is approximated by the parametric family Barron and Cover 
find that an optimal discretization of the parameter manifold (see and [Q) yields 
a bound on the resolvabihty of a parametric model of: 

RniPe) < - ( ^ logn + log ^ - ^ log c,/e + o(l) ) (59) 
n \2 w[U) 2 J 

where 6 is the true parameter value, J{6) is the Fisher Information at 6, w{6) is a prior 
density on the parameter manifold and Cd arises from sphere-packing problems and 
is close to 27re for large d. The asymptotic minimax value of the bound is attained 
by choosing the prior w{Q) to be Jeffreys prior ydetJ//ci/i(e)yditJ.H We see 
that aside from the factor of the leading terms reproduce the logarithm of the 
razor for the case when the true density is infinitesimally distant from the parameter 

^^This is one of the coding theoretic justifications for the choice of Jeffreys prior. We have provided 
a novel discussion of the choice of that pior in this paper. 
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manifold in relative entropy sense so that D{t\\6*) = 0. Barron and Cover use this 
bound to evaluate convergence rates of minimum complexity estimators. In contrast, 
this paper has presented the leading terms in an asymptotically exact expansion of 
the razor as an abstract measure of the complexity of a parametric model relative 
to a true distribution. I begin by studying what the meaning of "simplicity" should 
be in the context of Bayes Rule and the geometry of the space of distributions, and 
arrive at results that are closely related to the minimum complexity scheme. Saying 
the models with larger razors are preferred is asymptotically equivalent to saying 
the models with a lower resolvability (given an optimal discretization) are preferred. 
However, we see from a comparison of the logarithm of the razor and Equation that 
the resolvability bound is a truncation of the series expansion of the log razor which 
therefore gives a finer classification of model families. The geometric formulation 
of this paper leads to interpretations of the various terms in the razor that give 
an alternative understanding of the terms in the index of resolvability that govern 
the rate of convergence of minimum complexity estimators. We have also given a 
systematic scheme for evaluating the razor to all orders in This suggests that 

the results on optimal discretizations of parameter manifolds used in the index of 
resolvability should be extended to include such sub- leading terms. (|r7[,P) 

7 Conclusion 

In this paper we have set out to develop a measure of complexity of a parametric 
distribution as a description of a particular true distribution. We avoided appealing 
to the minimum description length principle or to results in coding theory in order to 
arrive at a more geometric understanding in terms of the embedding of the parametric 
model in the space of probability distributions. We constructed an index of complex- 
ity called the razor of a model whose asymptotic expansion was shown to reflect the 
accuracy and the simplicity of the model as a description of a given true distribu- 
tion. The terms in the asymptotic expansion were given geometrical interpretations 
in terms of distances and volumes in the space of distributions. These distances and 
volumes were computed in a metric and measure given by the Fisher Information on 
the model manifold and the square root of its determinant. This metric and measure 
were justified from a statistical and geometrical point of view by demonstrating that 
in a certain sense a uniform prior in the space of distributions would induce a Fisher 
Information (or Jeffreys) prior on a parameter manifold. More exactly, we assumed 
that indistinguishable distributions should not be counted separately in an integral 
over the model manifold and that there is a "translation invariance" in the space of 
distributions. We then showed that a Jeffreys prior can be rigorously constructed as 
the continuum limit of a sequence of discrete priors consistent with these assumptions. 
A technique of integration common in statistical physics was introduced to facilitate 
the asymptotic analysis of the razor and it was also used to analyze the asymptotics 
of the logarithm of the Bayesian posterior. We have found that the razor defined in 
this paper reflects the typical order-by-order asymptotics of the Bayesian posterior 
probability just as the index of resolvability of Barron and Cover reflects the expected 
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asymptotics of the minimum complexity criterion studied by those authors. In par- 
ticular, any finite series of terms in the asymptotic expansion of the logarithm of the 
Bayesian posterior converges in probability to the corresponding series of terms in the 
asymptotic expansion of the razor. Examination of the logarithm of the Bayesian pos- 
terior and its relationship to the razor also suggested certain subleading geometrical 
corrections to the expected asymptotics of Bayes Rule and corresponding corrections 
to stochastic complexity defined by Rissanen. 
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